We study two recurring phenomena in Transformer language models: massive activations, in which a small number of tokens exhibit extreme outliers in a few channels, and attention sinks, in which certain tokens attract disproportionate attention mass regardless of semantic relevance. Prior work observes that these phenomena frequently co-occur and often involve the same tokens, but their functional roles and causal relationship remain unclear. Through systematic experiments, we show that the co-occurrence is largely an architectural artifact of modern Transformer design, and that the two phenomena serve related but distinct functions. Massive activations operate globally: they induce near-constant hidden representations that persist across layers, effectively functioning as implicit parameters of the model. Attention sinks operate locally: they modulate attention outputs across heads and bias individual heads toward short-range dependencies. We identify the pre-norm configuration as the key choice that enables the co-occurrence, and show that ablating it causes the two phenomena to decouple.
Qwen3-Coder-Next is an 80-billion-parameter language model that activates only 3 billion parameters during inference, achieving strong coding capabilities through agentic training with verifiable task synthesis and reinforcement learning. It is an open-weight model specialized for coding agents, and both base and instruction-tuned versions are released to support research and real-world coding agent development.
NVIDIA GTC is the premier AI conference and exhibition. Learn about the latest advancements in AI, deep learning, and accelerated computing. Includes keynote speakers, sessions, workshops, and an exhibit hall.
This article explores how agentic AI can revolutionize deep learning experimentation by automating tasks like hyperparameter tuning, architecture search, and data augmentation. It delves into the core concepts, benefits, and practical considerations of using agentic systems to accelerate and improve the deep learning workflow.
SpiderPi Pro is an advanced hexapod robot integrated with AI vision and powered by Raspberry Pi. It features intelligent serial bus servos with a torque of 20KG, 5DOF robot arm, glowy ultrasonic sensor, IMU sensor and dot matrix module and can be programmed using Python. SpiderPi Pro serves as an ideal platform for conducting research in motion control for hexapod robots, machine vision, OpenCV, deep learning, and various other fields.
A curated reading list for those starting to learn about Large Language Models (LLMs), covering foundational concepts, practical applications, and future trends, updated for 2026.
This article explores the field of mechanistic interpretability, aiming to understand how large language models (LLMs) work internally by reverse-engineering their computations. It discusses techniques for identifying and analyzing the functions of individual neurons and circuits within these models, offering insights into their decision-making processes.
Zhipu AI has released GLM-4.7-Flash, a 30B-A3B MoE model designed for efficient local coding and agent applications. It offers strong coding and reasoning performance with a 128k token context length and supports English and Chinese.
We introduce the Ministral 3 series, a family of parameter-efficient dense language models designed for compute and memory constrained applications, available in three model sizes: 3B, 8B, and 14B parameters. For each model size, we release three variants: a pretrained base model for general-purpose use, an instruction finetuned, and a reasoning model for complex problem-solving.
This blog post details how to implement high-performance matrix multiplication using NVIDIA cuTile, focusing on Tile loading, computation, storage, and block-level parallel programming. It also covers best practices for Tile programming and performance optimization strategies.